Wayback Machine URL Extractor - Archived URLs
Pricing
from $3.50 / 1,000 results
Wayback Machine URL Extractor - Archived URLs
Extract every archived URL of any domain from the Internet Archive's Wayback Machine (CDX API). Recover lost or old pages, build redirect maps and run OSINT, with date and status filters. No API key, export to CSV or JSON.
Pricing
from $3.50 / 1,000 results
Rating
0.0
(0)
Developer
Logiover
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
2 days ago
Last modified
Categories
Share
Wayback Machine URL Extractor 🕰️ — Archived URLs from the Internet Archive
Recover every historical URL a website has ever published — straight from the Internet Archive's Wayback Machine. This Wayback Machine scraper queries the public CDX API to extract archived URLs and historical URLs for any domain — including pages that were deleted, renamed, or lost in a migration. Feed in one domain and get back up to tens of thousands of unique URLs, each with its capture date, archived HTTP status, MIME type, and a direct Wayback snapshot link.
Point it at one domain and it pulls the full historical URL inventory automatically. No API key, no login, no rate-limit headaches — one row per archived URL.
Looking to recover old URLs after a site migration, build a redirect map, find old/deleted pages, do OSINT on a domain's history, or pull a list of Internet Archive URLs without writing CDX queries by hand? This is the Internet Archive URL extractor that does it at scale.
✨ Key features
- 🕰️ Full historical URL inventory — pulls every unique URL the Wayback Machine has on record for a domain, going back to 1996.
- 🔑 No API key required — uses the open Internet Archive CDX API; no auth, no token, no login.
- 🌐 Subdomain & path matching — capture the host plus all subdomains and paths, or narrow down to a single host or path prefix.
- 📅 Date-range filtering — restrict to snapshots captured between two dates (
fromDate/toDate). - ✅ Status-code filtering — keep only
200 OKcaptures and drop dead/redirected ones. - 🔗 Direct snapshot links — every row includes a ready-to-open
web.archive.org/web/...URL. - 🌊 Streamed pagination — pages through massive result sets with the CDX
resumeKeymechanism, so memory stays flat even on 100k+ URL domains. - 🔢 Result caps — set
maxResultsper domain, or0for unlimited. - 📋 Multiple domains per run — process a whole list in one go.
- 📤 Export-ready — JSON, CSV, and Excel output via the Apify Dataset or REST API.
💡 Use cases
- SEO migration & redirect maps — recover lost/old URLs after a site move and rebuild a complete 301 redirect map so you don't lose link equity.
- Content recovery — find and restore blog posts, product pages, or docs that were deleted but still live in the archive.
- OSINT & research — enumerate a target domain's historical footprint, old endpoints, removed pages, and forgotten subdomains.
- Link reclamation — find old URLs that still earn backlinks so you can redirect them and reclaim the link value.
- Finding old endpoints — surface admin paths, legacy APIs, and orphaned pages that no longer appear on the live site.
- Competitive & web-archaeology research — reconstruct how a competitor's URL structure and content changed across years of snapshots.
- Datasets — build a domain's URL/MIME/capture-history dataset for analysis.
📦 What you get
One row per unique archived URL, including:
| Field | Description |
|---|---|
domain | The normalized domain this URL belongs to |
url | The original archived URL |
timestamp | Raw 14-digit Wayback capture timestamp (YYYYMMDDhhmmss) |
capturedAt | ISO 8601 form of the capture timestamp |
statusCode | HTTP status the archive recorded for that capture (e.g. 200, 301, 404, or -) |
mimeType | Content type recorded at capture time (e.g. text/html) |
digest | Wayback content digest (used internally for de-duplication) |
snapshotUrl | Direct link to the archived snapshot on web.archive.org |
Example output
{"domain": "nasa.gov","url": "http://www.nasa.gov/mission_pages/station/main/index.html","timestamp": "20120114043915","capturedAt": "2012-01-14T04:39:15.000Z","statusCode": "200","mimeType": "text/html","digest": "AB23CD45EF67GH89IJ01KL23MN45OP67","snapshotUrl": "https://web.archive.org/web/20120114043915/http://www.nasa.gov/mission_pages/station/main/index.html"}
🚀 How to use it
- Click Try for free / Start.
- Add one or more domains to Domains (e.g.
nasa.gov,bbc.com). URLs andwww.are normalized automatically. - (Optional) Pick a matchType, set a date range, filter by status code, or raise maxResults (
0= unlimited). - Click Save & Start.
- Export the archived URL list as JSON, CSV, Excel or via API, and open any row's
snapshotUrlto view the archived page.
⚙️ Input
| Field | Type | Description | Default |
|---|---|---|---|
domains | array | Required. One or more domains or URLs (e.g. nasa.gov, bbc.com). Wildcards added automatically | – |
matchType | enum | subdomains (host + all subdomains + paths), host (exact host only), domain (host + subdomains), prefix (path prefix) | subdomains |
fromDate | string | Optional YYYYMMDD lower bound on capture date | – |
toDate | string | Optional YYYYMMDD upper bound on capture date | – |
filterStatus | string | Optional — only return captures with this HTTP status (e.g. 200) | – |
maxResults | integer | Max unique URLs per domain. 0 = unlimited | 5000 |
proxyConfiguration | object | Proxy settings. Defaults to Apify Proxy | Apify Proxy |
Example input
{"domains": ["nasa.gov"],"matchType": "subdomains","fromDate": "20100101","toDate": "20201231","filterStatus": "200","maxResults": 5000,"proxyConfiguration": { "useApifyProxy": true }}
🔍 How it works
- Each domain you provide is normalized — scheme,
www., paths and wildcards are stripped down to a bare host. - A CDX API query is built from your
matchType, date range, and status filter, requesting theoriginal,timestamp,statuscode,mimetypeanddigestfields withcollapse=urlkeyso each URL appears only once instead of returning every capture of it. - Results are paged using the CDX
showResumeKey/resumeKeymechanism, and each page is pushed to the dataset in a batch — so even domains with hundreds of thousands of archived URLs stream out without exhausting memory. - For every row, a direct
snapshotUrlis constructed in thehttps://web.archive.org/web/<timestamp>/<original-url>form, so you can open the exact archived page. - Slow responses,
5xx, and429errors are retried with exponential backoff on a fresh proxy IP — the CDX index can be slow, so retries keep large runs reliable.
🧰 Tips & best practices
- Big domains (news sites, government sites) can have hundreds of thousands of archived URLs. Start with the default
maxResultsof5000to gauge volume, then raise it or set0for everything. - Use
filterStatus: "200"to skip dead and redirected captures and keep only pages that actually resolved — ideal for building redirect maps. - Narrow with
fromDate/toDate(bothYYYYMMDD) when you only care about a specific era of the site. - Use
matchType: "subdomains"to sweep every subdomain at once, orhostfor a single host without its subdomains. - Sort or filter the dataset by
mimeTypeto isolate just HTML pages, images, PDFs, etc.
❓ FAQ
How do I get all URLs of a website from the Wayback Machine?
Add the domain to Domains, leave matchType on subdomains, set maxResults to 0 for everything, and run it. The actor queries the Internet Archive CDX API and returns one row per unique archived URL.
Can I find old or deleted pages of a domain?
Yes — that's the core use case. The Wayback Machine keeps URLs even after they're removed from the live site, so deleted blog posts, retired product pages, and old endpoints all show up in the results with a snapshotUrl to view them.
How do I export archived URLs to CSV or JSON?
Run the actor, then download the dataset as CSV, JSON or Excel (or pull it via the REST API). Every archived URL is one row, so it drops straight into a spreadsheet or pipeline.
Is this free and without an API key?
The Internet Archive CDX API is public and requires no API key and no login. You only pay for the Apify platform usage of the run itself.
Can I filter by date or status code?
Yes — set fromDate / toDate (YYYYMMDD) to restrict to a capture window, and filterStatus (e.g. 200) to keep only captures with a specific HTTP status.
How many URLs can it return?
Up to tens of thousands per domain — set maxResults to 0 for unlimited. Results stream to the dataset in pages via the CDX resumeKey, so even 100k+ URL domains run without memory issues.
Why are some statusCode values -?
The Wayback index sometimes records captures without a stored status code (e.g. revisit records). Those rows are still valid archived URLs.
🔗 Related actors by the same author
- Sitemap to URL Crawler — extract all URLs from any sitemap.xml.
- Website SEO Audit Crawler — run a full on-page SEO audit across a whole site.
- Bulk URL Status Checker — check HTTP status codes for a list of URLs in bulk.
- Broken Link Checker — crawl a site and find dead links with HTTP status codes.
📝 Changelog
2026-06-15
- Initial release — extract archived URLs from the Wayback Machine CDX API with date/status filters, CSV/JSON export, no API key.